This is an update to my previous take on Bayesian AB testing. Specifically, I looked at taking advantage of prior knowledge that the change in conversion rate shouldn't be large -- maybe because you've run dozens of similar tests and you're making only a small modification to the website. Previously I described a sort of multi-level beta-binomial model to account for that prior knowledge, but it turns out that regularized logistic regression is simpler and gives pretty much the same answer.

The data-generating model

We start with a baseline conversion rate $\mu$, from an uninformative prior*. Then, based on your prior knowledge of how likely the conversion rates will differ from one another, select a spread term $\sigma$. Next the two conversion rates $p_A$ and $p_B$ are drawn from $\text{logistic} \left[ \mathcal{N}(\mu, \sigma) \right]$. We then observe visits to the website which are Bernoulli distributed with rates $p_A$ and $p_B$.

* It's simple enough to use prior information for this too, but it turns out to have very little effect on the answer.

TODO: a little bit on how this model is just regularized logistic regression


You can solve for the maximum-likelihood estimates of $p_A$ and $p_B$ directly, but it's nice to have a full distribution to work with. For example, you might want to report various credible intervals of interest. To get the distributions, I'll use Stan.


In [102]:
import pandas as pd
import matplotlib.pyplot as plt
from numpy import exp
from pystan import StanModel

# pystan gives some annoying warnings when sampling
import warnings
warnings.filterwarnings('ignore')

logistic = lambda x: 1/(1+exp(-x))

The Stan code for the 'naive', usual beta-binomial model:


In [115]:
naive_model_code = '''
data {
    int<lower=0> successes[2];
    int<lower=0> trials[2];
    real<lower=0> alpha0;
    real<lower=0> beta0;
}
parameters {
    vector[2] p;
}
model {
    p ~ beta(alpha0, beta0);
    successes ~ binomial(trials, p);
}
'''
naive_model = StanModel(model_code=naive_model_code)

The Stan code for the improved beta-binomial model from the previous notebook:


In [103]:
bb_model_code = '''
data {
    int<lower=0> successes[2];
    int<lower=0> trials[2];
    real<lower=0> alpha0;
    real<lower=0> beta0;
    real<lower=0> tau;
}
parameters {
    real mu;
    vector[2] p;
}
transformed parameters {
    real<lower=0> alpha1;
    real<lower=0> beta1;
    alpha1 <- tau*mu;
    beta1 <- tau*(1-mu);
}
model {
    p ~ beta(alpha1, beta1);
    successes ~ binomial(trials, p);
}
'''
bb_model = StanModel(model_code=bb_model_code)

The Stan code for the new logistic regression model:


In [104]:
lr_model_code = '''
data {
    int<lower=0> successes[2];
    int<lower=0> trials[2];
    real<lower=0> sigma;
}
parameters {
    real mu;
    vector[2] coef;
}
model {
    coef ~ normal(mu, sigma);
    successes ~ binomial_logit(trials, coef);
}
'''
lr_model = StanModel(model_code=lr_model_code)

Example

Let's look at the same example as the previous notebook. Check the fit summary for each model to make sure things converged and behaved as expected.


In [108]:
successesA = 300
successesB = 350
failuresA = 3000
failuresB = 3000

successes = [successesA, successesB]
trials = [successesA+failuresA, successesB+failuresB]

pA, pB = successesA/(successesA+failuresA), successesB/(successesB+failuresB)
pA, pB, (pB-pA)/pA


Out[108]:
(0.09090909090909091, 0.1044776119402985, 0.1492537313432835)

In [122]:
stan_data = dict(successes=successes, trials=trials, alpha0=1, beta0=1)
naive_fit1 = naive_model.sampling(data=stan_data)
naive_fit1


Out[122]:
Inference for Stan model: anon_model_6d413c36495b1f1de92fa8ddc3024cf0.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p[0]   0.09  4.7e-4 4.8e-3   0.08   0.09   0.09   0.09    0.1    105   1.02
p[1]    0.1  3.6e-4 5.2e-3   0.09    0.1    0.1   0.11   0.11    211   1.01
lp__  -2127    0.05   0.93  -2130  -2128  -2127  -2127  -2126    381   1.01

Samples were drawn using NUTS(diag_e) at Sun Feb  1 17:13:34 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

In [123]:
stan_data = dict(successes=successes, trials=trials, alpha0=30, beta0=270)
naive_fit2 = naive_model.sampling(data=stan_data)
naive_fit2


Out[123]:
Inference for Stan model: anon_model_6d413c36495b1f1de92fa8ddc3024cf0.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p[0]   0.09  3.8e-4 5.0e-3   0.08   0.09   0.09   0.09    0.1    171   1.02
p[1]    0.1  3.3e-4 5.0e-3   0.09    0.1    0.1   0.11   0.11    230   1.01
lp__  -2318    0.06   1.05  -2320  -2318  -2317  -2317  -2317    286   1.01

Samples were drawn using NUTS(diag_e) at Sun Feb  1 17:14:00 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

In [112]:
stan_data = dict(successes=successes, trials=trials, alpha0=3, beta0=27, tau=1500)
bb_fit = bb_model.sampling(data=stan_data)
bb_fit


Out[112]:
Inference for Stan model: anon_model_472b220940aa6b1a70571ed176fc238a.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu        0.1  2.2e-4 6.7e-3   0.09   0.09    0.1    0.1   0.11    904    1.0
p[0]     0.09  1.5e-4 4.7e-3   0.08   0.09   0.09    0.1    0.1    928    1.0
p[1]      0.1  1.6e-4 4.8e-3   0.09    0.1    0.1   0.11   0.11    947    1.0
alpha1 146.84    0.33  10.06 127.52 139.91 146.74 153.69 166.38    904    1.0
beta1  1353.1    0.33  10.06 1333.6 1346.3 1353.2 1360.0 1372.4    904    1.0
lp__    -2121    0.05   1.29  -2124  -2121  -2120  -2120  -2119    800    1.0

Samples were drawn using NUTS(diag_e) at Sun Feb  1 17:03:37 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

In [110]:
stan_data = dict(successes=successes, trials=trials, sigma=.09)
lr_fit = lr_model.sampling(data=stan_data)
lr_fit


Out[110]:
Inference for Stan model: anon_model_f81458e5eafcf826b0b5d753bc432a22.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu       -2.22  2.4e-3   0.07  -2.37  -2.27  -2.22  -2.18  -2.08    962    1.0
coef[0]  -2.28  1.8e-3   0.05  -2.39  -2.32  -2.28  -2.24  -2.17    930    1.0
coef[1]  -2.17  1.6e-3   0.05  -2.27   -2.2  -2.17  -2.14  -2.07   1006    1.0
lp__     -2128    0.04   1.24  -2132  -2129  -2128  -2127  -2127    876    1.0

Samples were drawn using NUTS(diag_e) at Sun Feb  1 17:02:17 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Combine all the estimates into one dataframe for plotting:


In [124]:
p = pd.DataFrame(naive_fit1.extract()['p'], columns=['A', 'B'])
naive_data1 = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
naive_data1['variable'] = 'naive flat prior'

p = pd.DataFrame(naive_fit2.extract()['p'], columns=['A', 'B'])
naive_data2 = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
naive_data2['variable'] = 'naive informative prior'

p = pd.DataFrame(bb_fit.extract()['p'], columns=['A', 'B'])
bb_data = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
bb_data['variable'] = 'improved beta-binomial'

p = pd.DataFrame(logistic(lr_fit.extract()['coef']), columns=['A', 'B'])
lr_data = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
lr_data['variable'] = 'logistic regression'

data = pd.concat([naive_data1, naive_data2, bb_data, lr_data])

Summary

The new, simpler logistic regression model gives basically the same answer as the previous beta-binomial model.


In [125]:
data.groupby('variable').value.plot(kind='kde')
plt.legend()
plt.xlim(-.1, .4);